Exploratory Data Analysis By: Chinmay Jain ST1128 Manisha Mehta ST1147 Sarthak Priyank Verma ST1170 Issac ST1137
Data Set: IMDB Movie Rating
Source: https://www.kaggle.com/datasets/carolzhangdc/imdb-5000-movie-dataset
Background: A commercial success movie not only entertains audience, but also enables film companies to gain tremendous profit. A lot of factors such as good directors, experienced actors are considerable for creating good movies. However, famous directors and actors can always bring an expected box-office income but cannot guarantee a highly rated IMDb score.
Problem statement: A renowned film production company has partnered with Mu Sigma to gain a comprehensive understanding of the key factors that contribute to the success of movies. The primary goal is to leverage an in-depth analysis of IMDb ratings to identify patterns, insights, and actionable recommendations that can enhance the director’s future film projects and overall success in the highly competitive film industry.
Expected Outcomes: Identifying the key factors that influence IMDb rating. This can be done by performing correlation analysis on the IMDb dataset to identify the factors that are most strongly correlated with IMDb rating.
Understanding the relationships between different factors and IMDb rating. Mu Sigma can use analysis to model the relationship between IMDb rating and other movie features. This can help to understand how different factors impact IMDb rating, and to identify the factors that are most important for predicting the success of a movie.
Segmenting movies into different groups based on their features. This can help to identify different types of movies and to understand the characteristics of each type of movie. This information can be used to develop more targeted marketing and distribution strategies for different types of movies.
Target Variables: IMDb Score,Profit,FaceBook Likes
Columns description(meta data)
color: This column indicates whether the movie is in color (e.g., “Color” or “Black and White”).
director_name: The name of the movie’s director, the person responsible for overseeing the creative aspects of the film.
num_critic_for_reviews: this column represents the number of critic reviews or critiques that a movie has received, which can provide insight into its critical reception.
duration: The duration of the movie in minutes, indicating its length.
director_facebook_likes: The number of Facebook likes for the movie’s director, indicating their social media popularity.
actor_3_facebook_likes: The number of Facebook likes for the third-billed actor in the movie’s cast, indicating their popularity
actor_2_name: The name of the second-billed actor in the movie’s cast.
actor_1_facebook_likes: The number of Facebook likes for the first-billed actor in the movie’s cast.
genres: The genres or categories that the movie belongs to (e.g., “Action,” “Comedy,” “Drama,” etc.).
actor_1_name: The name of the first-billed actor in the movie’s cast.
movie_title: The title of the movie.
num_voted_users: The number of users who voted or rated the movie, which can reflect its popularity.
cast_total_facebook_likes: The total number of Facebook likes for the movie’s entire cast.
actor_3_name: The name of the third-billed actor in the movie’s cast.
facenumber_in_poster: The number of faces on the movie poster which may or may not be relevant to the movie’s success.
plot_keywords: Keywords or phrases describing the movie’s plot, themes, or content.
movie_imdb_link: A link to the movie’s IMDb page for additional information.
num_user_for_reviews: The number of user reviews for the movie, which can provide insight into its audience reception.
language: The primary language in which the movie is spoken or produced.
country: The country of origin for the movie.
content_rating: The content rating assigned to the movie, such as “PG-13,” “R,” “G,” etc.
budget: The budget or money used for making the entire movie
title_year: The year in which the movie was released.
actor_2_facebook_likes: The number of Facebook likes for the second-billed actor in the movie’s cast.
imdb_score: The IMDb rating score reflecting the movie’s overall quality as rated by users.
aspect_ratio: The aspect ratio is used for the movie’s display ratio (e.g., 16:9, 2.35:1).
movie_facebook_likes: The number of Facebook likes for the movie’s official Facebook page.
gross: The total gross revenue generated by the movie, indicating its financial success.
Importing relevant libraries
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.1.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(rvest)
## Warning: package 'rvest' was built under R version 4.1.3
library(plotly)
## Warning: package 'plotly' was built under R version 4.1.3
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.1.3
library(ggplot2)
library(stringr)
## Warning: package 'stringr' was built under R version 4.1.3
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.1.3
## corrplot 0.92 loaded
Reading the CSV file containing IMDb movie metadata into a data frame named IMDb
IMDB <- read.csv("C:\\Users\\chinmay.Jain\\Desktop\\R\\movie.csv")
Summary statistics for the IMDB data frame
summary(IMDB)
## color director_name num_critic_for_reviews duration
## Length:5043 Length:5043 Min. : 1.0 Min. : 7.0
## Class :character Class :character 1st Qu.: 50.0 1st Qu.: 93.0
## Mode :character Mode :character Median :110.0 Median :103.0
## Mean :140.2 Mean :107.2
## 3rd Qu.:195.0 3rd Qu.:118.0
## Max. :813.0 Max. :511.0
## NA's :50 NA's :15
## director_facebook_likes actor_3_facebook_likes actor_2_name
## Min. : 0.0 Min. : 0.0 Length:5043
## 1st Qu.: 7.0 1st Qu.: 133.0 Class :character
## Median : 49.0 Median : 371.5 Mode :character
## Mean : 686.5 Mean : 645.0
## 3rd Qu.: 194.5 3rd Qu.: 636.0
## Max. :23000.0 Max. :23000.0
## NA's :104 NA's :23
## actor_1_facebook_likes gross genres
## Min. : 0 Min. : 162 Length:5043
## 1st Qu.: 614 1st Qu.: 5340988 Class :character
## Median : 988 Median : 25517500 Mode :character
## Mean : 6560 Mean : 48468408
## 3rd Qu.: 11000 3rd Qu.: 62309438
## Max. :640000 Max. :760505847
## NA's :7 NA's :884
## actor_1_name movie_title num_voted_users
## Length:5043 Length:5043 Min. : 5
## Class :character Class :character 1st Qu.: 8594
## Mode :character Mode :character Median : 34359
## Mean : 83668
## 3rd Qu.: 96309
## Max. :1689764
##
## cast_total_facebook_likes actor_3_name facenumber_in_poster
## Min. : 0 Length:5043 Min. : 0.000
## 1st Qu.: 1411 Class :character 1st Qu.: 0.000
## Median : 3090 Mode :character Median : 1.000
## Mean : 9699 Mean : 1.371
## 3rd Qu.: 13756 3rd Qu.: 2.000
## Max. :656730 Max. :43.000
## NA's :13
## plot_keywords movie_imdb_link num_user_for_reviews language
## Length:5043 Length:5043 Min. : 1.0 Length:5043
## Class :character Class :character 1st Qu.: 65.0 Class :character
## Mode :character Mode :character Median : 156.0 Mode :character
## Mean : 272.8
## 3rd Qu.: 326.0
## Max. :5060.0
## NA's :21
## country content_rating budget title_year
## Length:5043 Length:5043 Min. :2.180e+02 Min. :1916
## Class :character Class :character 1st Qu.:6.000e+06 1st Qu.:1999
## Mode :character Mode :character Median :2.000e+07 Median :2005
## Mean :3.975e+07 Mean :2002
## 3rd Qu.:4.500e+07 3rd Qu.:2011
## Max. :1.222e+10 Max. :2016
## NA's :492 NA's :108
## actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
## Min. : 0 Min. :1.600 Min. : 1.18 Min. : 0
## 1st Qu.: 281 1st Qu.:5.800 1st Qu.: 1.85 1st Qu.: 0
## Median : 595 Median :6.600 Median : 2.35 Median : 166
## Mean : 1652 Mean :6.442 Mean : 2.22 Mean : 7526
## 3rd Qu.: 918 3rd Qu.:7.200 3rd Qu.: 2.35 3rd Qu.: 3000
## Max. :137000 Max. :9.500 Max. :16.00 Max. :349000
## NA's :13 NA's :329
Viewing the top 5 rows of the dataset
head(IMDB, 5)
## color director_name num_critic_for_reviews duration
## 1 Color James Cameron 723 178
## 2 Color Gore Verbinski 302 169
## 3 Color Sam Mendes 602 148
## 4 Color Christopher Nolan 813 164
## 5 Doug Walker NA NA
## director_facebook_likes actor_3_facebook_likes actor_2_name
## 1 0 855 Joel David Moore
## 2 563 1000 Orlando Bloom
## 3 0 161 Rory Kinnear
## 4 22000 23000 Christian Bale
## 5 131 NA Rob Walker
## actor_1_facebook_likes gross genres
## 1 1000 760505847 Action|Adventure|Fantasy|Sci-Fi
## 2 40000 309404152 Action|Adventure|Fantasy
## 3 11000 200074175 Action|Adventure|Thriller
## 4 27000 448130642 Action|Thriller
## 5 131 NA Documentary
## actor_1_name movie_title
## 1 CCH Pounder AvatarÂ
## 2 Johnny Depp Pirates of the Caribbean: At World's EndÂ
## 3 Christoph Waltz SpectreÂ
## 4 Tom Hardy The Dark Knight RisesÂ
## 5 Doug Walker Star Wars: Episode VII - The Force AwakensÂ
## num_voted_users cast_total_facebook_likes actor_3_name
## 1 886204 4834 Wes Studi
## 2 471220 48350 Jack Davenport
## 3 275868 11700 Stephanie Sigman
## 4 1144337 106759 Joseph Gordon-Levitt
## 5 8 143
## facenumber_in_poster
## 1 0
## 2 0
## 3 1
## 4 0
## 5 0
## plot_keywords
## 1 avatar|future|marine|native|paraplegic
## 2 goddess|marriage ceremony|marriage proposal|pirate|singapore
## 3 bomb|espionage|sequel|spy|terrorist
## 4 deception|imprisonment|lawlessness|police officer|terrorist plot
## 5
## movie_imdb_link num_user_for_reviews
## 1 http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1 3054
## 2 http://www.imdb.com/title/tt0449088/?ref_=fn_tt_tt_1 1238
## 3 http://www.imdb.com/title/tt2379713/?ref_=fn_tt_tt_1 994
## 4 http://www.imdb.com/title/tt1345836/?ref_=fn_tt_tt_1 2701
## 5 http://www.imdb.com/title/tt5289954/?ref_=fn_tt_tt_1 NA
## language country content_rating budget title_year actor_2_facebook_likes
## 1 English USA PG-13 2.37e+08 2009 936
## 2 English USA PG-13 3.00e+08 2007 5000
## 3 English UK PG-13 2.45e+08 2015 393
## 4 English USA PG-13 2.50e+08 2012 23000
## 5 NA NA 12
## imdb_score aspect_ratio movie_facebook_likes
## 1 7.9 1.78 33000
## 2 7.1 2.35 0
## 3 6.8 2.35 85000
## 4 8.5 2.35 164000
## 5 7.1 NA 0
Viewing dimensions of the dataset
dim(IMDB)
## [1] 5043 28
Getting the data types of each column in the dataset
str(IMDB)
## 'data.frame': 5043 obs. of 28 variables:
## $ color : chr "Color" "Color" "Color" "Color" ...
## $ director_name : chr "James Cameron" "Gore Verbinski" "Sam Mendes" "Christopher Nolan" ...
## $ num_critic_for_reviews : int 723 302 602 813 NA 462 392 324 635 375 ...
## $ duration : int 178 169 148 164 NA 132 156 100 141 153 ...
## $ director_facebook_likes : int 0 563 0 22000 131 475 0 15 0 282 ...
## $ actor_3_facebook_likes : int 855 1000 161 23000 NA 530 4000 284 19000 10000 ...
## $ actor_2_name : chr "Joel David Moore" "Orlando Bloom" "Rory Kinnear" "Christian Bale" ...
## $ actor_1_facebook_likes : int 1000 40000 11000 27000 131 640 24000 799 26000 25000 ...
## $ gross : int 760505847 309404152 200074175 448130642 NA 73058679 336530303 200807262 458991599 301956980 ...
## $ genres : chr "Action|Adventure|Fantasy|Sci-Fi" "Action|Adventure|Fantasy" "Action|Adventure|Thriller" "Action|Thriller" ...
## $ actor_1_name : chr "CCH Pounder" "Johnny Depp" "Christoph Waltz" "Tom Hardy" ...
## $ movie_title : chr "Avatar " "Pirates of the Caribbean: At World's End " "Spectre " "The Dark Knight Rises " ...
## $ num_voted_users : int 886204 471220 275868 1144337 8 212204 383056 294810 462669 321795 ...
## $ cast_total_facebook_likes: int 4834 48350 11700 106759 143 1873 46055 2036 92000 58753 ...
## $ actor_3_name : chr "Wes Studi" "Jack Davenport" "Stephanie Sigman" "Joseph Gordon-Levitt" ...
## $ facenumber_in_poster : int 0 0 1 0 0 1 0 1 4 3 ...
## $ plot_keywords : chr "avatar|future|marine|native|paraplegic" "goddess|marriage ceremony|marriage proposal|pirate|singapore" "bomb|espionage|sequel|spy|terrorist" "deception|imprisonment|lawlessness|police officer|terrorist plot" ...
## $ movie_imdb_link : chr "http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt0449088/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt2379713/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt1345836/?ref_=fn_tt_tt_1" ...
## $ num_user_for_reviews : int 3054 1238 994 2701 NA 738 1902 387 1117 973 ...
## $ language : chr "English" "English" "English" "English" ...
## $ country : chr "USA" "USA" "UK" "USA" ...
## $ content_rating : chr "PG-13" "PG-13" "PG-13" "PG-13" ...
## $ budget : num 2.37e+08 3.00e+08 2.45e+08 2.50e+08 NA ...
## $ title_year : int 2009 2007 2015 2012 NA 2012 2007 2010 2015 2009 ...
## $ actor_2_facebook_likes : int 936 5000 393 23000 12 632 11000 553 21000 11000 ...
## $ imdb_score : num 7.9 7.1 6.8 8.5 7.1 6.6 6.2 7.8 7.5 7.5 ...
## $ aspect_ratio : num 1.78 2.35 2.35 2.35 NA 2.35 2.35 1.85 2.35 2.35 ...
## $ movie_facebook_likes : int 33000 0 85000 164000 0 24000 0 29000 118000 10000 ...
Count the number of duplicated rows in the dataset
sum(duplicated(IMDB))
## [1] 45
Remove duplicate rows from the dataset
IMDB <- unique(IMDB)
Print the first 5 movie titles
head(IMDB$movie_title, 5)
## [1] "Avatar "
## [2] "Pirates of the Caribbean: At World's End "
## [3] "Spectre "
## [4] "The Dark Knight Rises "
## [5] "Star Wars: Episode VII - The Force Awakens "
Remove special character “” from movie titles in the movie_title column
Replace special character with “” and Trimming spaces from the right end
IMDB$movie_title <- str_replace_all(IMDB$movie_title, "Â", "")
IMDB$movie_title <- str_trim(IMDB$movie_title, side = "right")
head(IMDB$movie_title)
## [1] "Avatar"
## [2] "Pirates of the Caribbean: At World's End"
## [3] "Spectre"
## [4] "The Dark Knight Rises"
## [5] "Star Wars: Episode VII - The Force Awakens"
## [6] "John Carter"
Separating rows in the ‘genres’ column based on the ‘|’ separator and creating a new column ‘genre_indicator’ with value 1
IMDB <- IMDB %>%
separate_rows(genres, sep = "\\|") %>%
mutate(genre_indicator = 1) %>%
spread(genres, genre_indicator, fill = 0)
IMDB
## # A tibble: 4,998 x 53
## color director_name num_critic_for_reviews duration director_facebook_li~1
## <chr> <chr> <int> <int> <int>
## 1 "Color" James Cameron 723 178 0
## 2 "Color" Gore Verbinski 302 169 563
## 3 "Color" Sam Mendes 602 148 0
## 4 "Color" Christopher N~ 813 164 22000
## 5 "" Doug Walker NA NA 131
## 6 "Color" Andrew Stanton 462 132 475
## 7 "Color" Sam Raimi 392 156 0
## 8 "Color" Nathan Greno 324 100 15
## 9 "Color" Joss Whedon 635 141 0
## 10 "Color" David Yates 375 153 282
## # i 4,988 more rows
## # i abbreviated name: 1: director_facebook_likes
## # i 48 more variables: actor_3_facebook_likes <int>, actor_2_name <chr>,
## # actor_1_facebook_likes <int>, gross <int>, actor_1_name <chr>,
## # movie_title <chr>, num_voted_users <int>, cast_total_facebook_likes <int>,
## # actor_3_name <chr>, facenumber_in_poster <int>, plot_keywords <chr>,
## # movie_imdb_link <chr>, num_user_for_reviews <int>, language <chr>, ...
Counting the number of NULL values in each column
colSums(is.na(IMDB))
## color director_name num_critic_for_reviews
## 0 0 49
## duration director_facebook_likes actor_3_facebook_likes
## 15 103 23
## actor_2_name actor_1_facebook_likes gross
## 0 7 874
## actor_1_name movie_title num_voted_users
## 0 0 0
## cast_total_facebook_likes actor_3_name facenumber_in_poster
## 0 0 13
## plot_keywords movie_imdb_link num_user_for_reviews
## 0 0 21
## language country content_rating
## 0 0 0
## budget title_year actor_2_facebook_likes
## 487 107 13
## imdb_score aspect_ratio movie_facebook_likes
## 0 327 0
## Action Adventure Animation
## 0 0 0
## Biography Comedy Crime
## 0 0 0
## Documentary Drama Family
## 0 0 0
## Fantasy Film-Noir Game-Show
## 0 0 0
## History Horror Music
## 0 0 0
## Musical Mystery News
## 0 0 0
## Reality-TV Romance Sci-Fi
## 0 0 0
## Short Sport Thriller
## 0 0 0
## War Western
## 0 0
Calculating null percentage of each column
null_percentages <- colMeans(is.na(IMDB)) * 100
print(null_percentages)
## color director_name num_critic_for_reviews
## 0.0000000 0.0000000 0.9803922
## duration director_facebook_likes actor_3_facebook_likes
## 0.3001200 2.0608243 0.4601841
## actor_2_name actor_1_facebook_likes gross
## 0.0000000 0.1400560 17.4869948
## actor_1_name movie_title num_voted_users
## 0.0000000 0.0000000 0.0000000
## cast_total_facebook_likes actor_3_name facenumber_in_poster
## 0.0000000 0.0000000 0.2601040
## plot_keywords movie_imdb_link num_user_for_reviews
## 0.0000000 0.0000000 0.4201681
## language country content_rating
## 0.0000000 0.0000000 0.0000000
## budget title_year actor_2_facebook_likes
## 9.7438976 2.1408563 0.2601040
## imdb_score aspect_ratio movie_facebook_likes
## 0.0000000 6.5426170 0.0000000
## Action Adventure Animation
## 0.0000000 0.0000000 0.0000000
## Biography Comedy Crime
## 0.0000000 0.0000000 0.0000000
## Documentary Drama Family
## 0.0000000 0.0000000 0.0000000
## Fantasy Film-Noir Game-Show
## 0.0000000 0.0000000 0.0000000
## History Horror Music
## 0.0000000 0.0000000 0.0000000
## Musical Mystery News
## 0.0000000 0.0000000 0.0000000
## Reality-TV Romance Sci-Fi
## 0.0000000 0.0000000 0.0000000
## Short Sport Thriller
## 0.0000000 0.0000000 0.0000000
## War Western
## 0.0000000 0.0000000
Since the percentage of null values is <10% Dropping rows with any NA values from the IMDB dataset
IMDB <- na.omit(IMDB)
colSums(is.na(IMDB))
## color director_name num_critic_for_reviews
## 0 0 0
## duration director_facebook_likes actor_3_facebook_likes
## 0 0 0
## actor_2_name actor_1_facebook_likes gross
## 0 0 0
## actor_1_name movie_title num_voted_users
## 0 0 0
## cast_total_facebook_likes actor_3_name facenumber_in_poster
## 0 0 0
## plot_keywords movie_imdb_link num_user_for_reviews
## 0 0 0
## language country content_rating
## 0 0 0
## budget title_year actor_2_facebook_likes
## 0 0 0
## imdb_score aspect_ratio movie_facebook_likes
## 0 0 0
## Action Adventure Animation
## 0 0 0
## Biography Comedy Crime
## 0 0 0
## Documentary Drama Family
## 0 0 0
## Fantasy Film-Noir Game-Show
## 0 0 0
## History Horror Music
## 0 0 0
## Musical Mystery News
## 0 0 0
## Reality-TV Romance Sci-Fi
## 0 0 0
## Short Sport Thriller
## 0 0 0
## War Western
## 0 0
Identifying the numeric and categorical columns
num_vars <- IMDB %>%
select_if(is.numeric) %>%
colnames()
cat_vars <- setdiff(names(IMDB), num_vars)
UNIVARIATE ANALYSIS OF THE NUMERICAL COLUMNS
Performing univariate analysis for numerical columns
for (column in num_vars) {
# Create histogram plot
hist_data <- IMDB[[column]]
hist(hist_data, main = paste("Univariate Analysis of", column), xlab = column, col = "skyblue", border = "black")
}
UNIVARIATE ANALYSIS OF THE CATEGORICAL COLUMNS
Creating Bar graph for ‘COLOR’
bar_color <- ggplot(IMDB, aes(x = color)) +
geom_bar(fill = "skyblue", color = "black") +
labs(title = "Bar graph of Color") +
theme_minimal()
bar_color
Creating Bar graph for ‘director_name’
bar_director_name <- ggplot(IMDB, aes(x = director_name)) +
geom_bar(fill = "skyblue", color = "white") +
labs(title = "Bar graph of Director Name") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
bar_director_name
Creating Bar graph for ‘content_rating’
bar_content_rating <- ggplot(IMDB, aes(x = content_rating)) +
geom_bar(fill = "skyblue", color = "black") +
labs(title = "Bar graph of Content Rating") +
theme_minimal()
bar_content_rating
Creating bar graph for ‘actor_2_name’
bar_actor_2_name <- ggplot(IMDB, aes(x = actor_2_name)) +
geom_bar(fill = "skyblue", color = "black") +
labs(title = "bar graph of Actor 2 Name") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
bar_actor_2_name
Creating bar chart for ‘movie_title’
bar_movie_title <- ggplot(IMDB, aes(x = movie_title)) +
geom_bar(fill = "skyblue", color = "black") +
labs(title = "Bar graph of Movie Title") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
bar_movie_title
Creating Bar graph for ‘language’
bar_language <- ggplot(IMDB, aes(x = language)) +
geom_bar(fill = "skyblue", color = "black") +
labs(title = "Bar graph of Language") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
bar_language
Creating Bar graph for ‘country’
bar_country <- ggplot(IMDB, aes(x = country)) +
geom_bar(fill = "skyblue", color = "black") +
labs(title = "Bar graph of Country") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
bar_country
BIVARIATE ANALYSIS
scatter_color_num_critic <- ggplot(IMDB, aes(x = color, y = num_critic_for_reviews, color = color)) +
geom_point() +
labs(title = "Scatter Plot: Color vs Num Critic for Reviews") +
theme_minimal()
# Scatter plot for 'content_rating' vs 'num_critic_for_reviews'
scatter_content_rating_num_critic <- ggplot(IMDB, aes(x = content_rating, y = num_critic_for_reviews, color = content_rating)) +
geom_point() +
labs(title = "Scatter Plot: Content Rating vs Num Critic for Reviews") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for better readability
scatter_content_rating_num_critic
# Scatter plot for 'color' vs 'duration'
scatter_color_duration <- ggplot(IMDB, aes(x = color, y = duration, color = color)) +
geom_point() +
labs(title = "Scatter Plot: Color vs Duration") +
theme_minimal()
scatter_color_duration
# Scatter plot for 'content_rating' vs 'duration'
scatter_content_rating_duration <- ggplot(IMDB, aes(x = content_rating, y = duration, color = content_rating)) +
geom_point() +
labs(title = "Scatter Plot: Content Rating vs Duration") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for better readability
scatter_content_rating_duration
# Scatter plot for 'color' vs 'director_facebook_likes'
scatter_color_director_facebook <- ggplot(IMDB, aes(x = color, y = director_facebook_likes, color = color)) +
geom_point() +
labs(title = "Scatter Plot: Color vs Director Facebook Likes") +
theme_minimal()
scatter_color_director_facebook
# Scatter plot for 'content_rating' vs 'director_facebook_likes'
scatter_content_rating_director_facebook <- ggplot(IMDB, aes(x = content_rating, y = director_facebook_likes, color = content_rating)) +
geom_point() +
labs(title = "Scatter Plot: Content Rating vs Director Facebook Likes") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for better readability
scatter_content_rating_director_facebook
# Scatter plot for 'color' vs 'actor_3_facebook_likes'
scatter_color_actor3_facebook_likes <- ggplot(IMDB, aes(x = color, y = actor_3_facebook_likes, color = color)) +
geom_point() +
labs(title = "Scatter Plot: Color vs Actor 3 Facebook Likes") +
theme_minimal()
scatter_color_actor3_facebook_likes
# Scatter plot for 'content_rating' vs 'actor_3_facebook_likes'
scatter_content_rating_actor3_facebook_likes <- ggplot(IMDB, aes(x = content_rating, y = actor_3_facebook_likes, color = content_rating)) +
geom_point() +
labs(title = "Scatter Plot: Content Rating vs Actor 3 Facebook Likes") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for better readability
scatter_content_rating_actor3_facebook_likes
# Scatter plot for 'color' vs 'actor_1_facebook_likes'
scatter_color_actor1_facebook_likes <- ggplot(IMDB, aes(x = color, y = actor_1_facebook_likes, color = color)) +
geom_point() +
labs(title = "Scatter Plot: Color vs Actor 1 Facebook Likes") +
theme_minimal()
scatter_color_actor1_facebook_likes
# Scatter plot for 'color' vs 'num_voted_users'
scatter_color_num_voted_users <- ggplot(IMDB, aes(x = color, y = num_voted_users, color = color)) +
geom_point() +
labs(title = "Scatter Plot: Color vs Num Voted Users") +
theme_minimal()
scatter_color_num_voted_users
# Scatter plot for 'content_rating' vs 'num_voted_users'
scatter_content_rating_num_voted_users <- ggplot(IMDB, aes(x = content_rating, y = num_voted_users, color = content_rating)) +
geom_point() +
labs(title = "Scatter Plot: Content Rating vs Num Voted Users") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
scatter_content_rating_num_voted_users
# Scatter plot for 'color' vs 'facenumber_in_poster'
scatter_color_facenumber_in_poster <- ggplot(IMDB, aes(x = color, y = facenumber_in_poster, color = color)) +
geom_point() +
labs(title = "Scatter Plot: Color vs Face Number in Poster") +
theme_minimal()
scatter_color_facenumber_in_poster
# Scatter plot for 'content_rating' vs 'facenumber_in_poster'
scatter_content_rating_facenumber_in_poster <- ggplot(IMDB, aes(x = content_rating, y = facenumber_in_poster, color = content_rating)) +
geom_point() +
labs(title = "Scatter Plot: Content Rating vs Face Number in Poster") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
scatter_content_rating_facenumber_in_poster
# Scatter plot for 'color' vs 'num_user_for_reviews'
scatter_color_num_user_for_reviews <- ggplot(IMDB, aes(x = color, y = num_user_for_reviews, color = color)) +
geom_point() +
labs(title = "Scatter Plot: Color vs Num User for Reviews") +
theme_minimal()
scatter_color_num_user_for_reviews
# Scatter plot for 'content_rating' vs 'num_user_for_reviews'
scatter_content_rating_num_user_for_reviews <- ggplot(IMDB, aes(x = content_rating, y = num_user_for_reviews, color = content_rating)) +
geom_point() +
labs(title = "Scatter Plot: Content Rating vs Num User for Reviews") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
scatter_content_rating_num_user_for_reviews
# Scatter plot for 'color' vs 'budget'
scatter_color_budget <- ggplot(IMDB, aes(x = color, y = budget, color = color)) +
geom_point() +
labs(title = "Scatter Plot: Color vs Budget") +
theme_minimal()
scatter_color_budget
# Scatter plot for 'content_rating' vs 'budget'
scatter_content_rating_budget <- ggplot(IMDB, aes(x = content_rating, y = budget, color = content_rating)) +
geom_point() +
labs(title = "Scatter Plot: Content Rating vs Budget") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
scatter_content_rating_budget
# Scatter plot for 'color' vs 'title_year'
scatter_color_title_year <- ggplot(IMDB, aes(x = color, y = title_year, color = color)) +
geom_point() +
labs(title = "Scatter Plot: Color vs Title Year") +
theme_minimal()
scatter_color_title_year
# Scatter plot for 'content_rating' vs 'title_year'
scatter_content_rating_title_year <- ggplot(IMDB, aes(x = content_rating, y = title_year, color = content_rating)) +
geom_point() +
labs(title = "Scatter Plot: Content Rating vs Title Year") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
scatter_content_rating_title_year
# Scatter plot for 'color' vs 'imdb_score'
scatter_color_imdb_score <- ggplot(IMDB, aes(x = color, y = imdb_score, color = color)) +
geom_point() +
labs(title = "Scatter Plot: Color vs IMDB Score") +
theme_minimal()
scatter_color_imdb_score
# Scatter plot for 'content_rating' vs 'imdb_score'
scatter_content_rating_imdb_score <- ggplot(IMDB, aes(x = content_rating, y = imdb_score, color = content_rating)) +
geom_point() +
labs(title = "Scatter Plot: Content Rating vs IMDB Score") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
scatter_content_rating_imdb_score
# Scatter plot for 'color' vs 'aspect_ratio'
scatter_color_aspect_ratio <- ggplot(IMDB, aes(x = color, y = aspect_ratio, color = color)) +
geom_point() +
labs(title = "Scatter Plot: Color vs Aspect Ratio") +
theme_minimal()
scatter_color_aspect_ratio
# Scatter plot for 'content_rating' vs 'aspect_ratio'
scatter_content_rating_aspect_ratio <- ggplot(IMDB, aes(x = content_rating, y = aspect_ratio, color = content_rating)) +
geom_point() +
labs(title = "Scatter Plot: Content Rating vs Aspect Ratio") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
scatter_content_rating_aspect_ratio
# Scatter plot for 'color' vs 'movie_facebook_likes'
scatter_color_movie_facebook_likes <- ggplot(IMDB, aes(x = color, y = movie_facebook_likes, color = color)) +
geom_point() +
labs(title = "Scatter Plot: Color vs Movie Facebook Likes") +
theme_minimal()
scatter_color_movie_facebook_likes
# Scatter plot for 'content_rating' vs 'movie_facebook_likes'
scatter_content_rating_movie_facebook_likes <- ggplot(IMDB, aes(x = content_rating, y = movie_facebook_likes, color = content_rating)) +
geom_point() +
labs(title = "Scatter Plot: Content Rating vs Movie Facebook Likes") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
scatter_content_rating_movie_facebook_likes
# Scatter plot for 'color' vs 'gross'
scatter_color_gross <- ggplot(IMDB, aes(x = color, y = gross, color = color)) +
geom_point() +
labs(title = "Scatter Plot: Color vs Gross") +
theme_minimal()
scatter_color_gross
# Scatter plot for 'content_rating' vs 'gross'
scatter_content_rating_gross <- ggplot(IMDB, aes(x = content_rating, y = gross, color = content_rating)) +
geom_point() +
labs(title = "Scatter Plot: Content Rating vs Gross") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
scatter_content_rating_gross
scatter_color_num_critic
Dropping Columns
col_to_drop <- c('color', 'duration', 'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name', 'actor_1_facebook_likes', 'genres', 'actor_3_name', 'plot_keywords', 'movie_imdb_link', 'language', 'country', 'actor_2_facebook_likes', 'aspect_ratio')
Droping the irrelevant columns and print the head of the dataframe
IMDB1 <- IMDB[, !names(IMDB) %in% col_to_drop]
head(IMDB1, 5)
## # A tibble: 5 x 40
## director_name num_critic_for_reviews gross actor_1_name movie_title
## <chr> <int> <int> <chr> <chr>
## 1 James Cameron 723 760505847 CCH Pounder Avatar
## 2 Gore Verbinski 302 309404152 Johnny Depp Pirates of~
## 3 Sam Mendes 602 200074175 Christoph Waltz Spectre
## 4 Christopher Nolan 813 448130642 Tom Hardy The Dark K~
## 5 Andrew Stanton 462 73058679 Daryl Sabara John Carter
## # i 35 more variables: num_voted_users <int>, cast_total_facebook_likes <int>,
## # facenumber_in_poster <int>, num_user_for_reviews <int>,
## # content_rating <chr>, budget <dbl>, title_year <int>, imdb_score <dbl>,
## # movie_facebook_likes <int>, Action <dbl>, Adventure <dbl>, Animation <dbl>,
## # Biography <dbl>, Comedy <dbl>, Crime <dbl>, Documentary <dbl>, Drama <dbl>,
## # Family <dbl>, Fantasy <dbl>, `Film-Noir` <dbl>, `Game-Show` <dbl>,
## # History <dbl>, Horror <dbl>, Music <dbl>, Musical <dbl>, Mystery <dbl>, ...
Reason for dropping columns:
These columns ‘color’,‘duration’,‘director_facebook_likes’,‘actor_3_facebook_likes’,‘actor_2_name’,‘actor_1_facebook_likes’,‘genres’,‘actor_3_name’,‘plot_keywords’,‘movie_imdb_link’,‘language’,‘country’,‘title_year’,‘actor_2_facebook_likes’,‘aspect_ratio’] They do not contain information that directly influences or correlates with the IMDb rating, making them irrelevant for IMDb rating analysis.
‘color’: Majority of the movies are of the type colour thus laking meaning full information
‘duration’: The movie’s duration might not serve as a significant predictor for IMDb ratings or box office performance.
‘director_facebook_likes’: While a director’s influence can affect a movie’s success, the count of their Facebook likes may not be the most relevant metric.
‘actor_3_facebook_likes’, ‘actor_2_name’, ‘actor_1_facebook_likes’, ‘actor_3_name’, ‘actor_2_facebook_likes’: The Facebook likes of individual actors may not strongly indicate a movie’s success or quality.
‘genres’: Although the ‘genres’ column is valuable for genre-related analysis, if you have already extracted this information into binary columns,
‘plot_keywords’: Plot keywords tend to be highly specific and exhibit considerable variation.
‘movie_imdb_link’: The access location of the movie doesn’t derive any important information
‘language’: The language doesn’t show considerable relation with the target variable
‘country’:The country doesn’t show considerable relation with the target variable
‘title_year’: The year a movie was released doesn’t impact the ratings
‘aspect_ratio’: The aspect ratio of movies doesn’t significantly impact IMDb ratings or box office success.
By dropping these columns, we can focus on exploring and analyzing the more relevant features that have a stronger correlation with the ‘imdb_score’ target variable.
Summary statistics for the new dataframe
summary(IMDB1)
## director_name num_critic_for_reviews gross
## Length:3768 Min. : 1.0 Min. : 162
## Class :character 1st Qu.: 75.0 1st Qu.: 7571550
## Mode :character Median :137.0 Median : 29036498
## Mean :165.5 Mean : 51869535
## 3rd Qu.:223.0 3rd Qu.: 66466858
## Max. :813.0 Max. :760505847
## actor_1_name movie_title num_voted_users
## Length:3768 Length:3768 Min. : 5
## Class :character Class :character 1st Qu.: 18769
## Mode :character Mode :character Median : 53041
## Mean : 104398
## 3rd Qu.: 126909
## Max. :1689764
## cast_total_facebook_likes facenumber_in_poster num_user_for_reviews
## Min. : 0 Min. : 0.000 Min. : 1.0
## 1st Qu.: 1862 1st Qu.: 0.000 1st Qu.: 107.0
## Median : 3965 Median : 1.000 Median : 207.0
## Mean : 11382 Mean : 1.378 Mean : 332.6
## 3rd Qu.: 16122 3rd Qu.: 2.000 3rd Qu.: 395.2
## Max. :656730 Max. :43.000 Max. :5060.0
## content_rating budget title_year imdb_score
## Length:3768 Min. :2.180e+02 Min. :1920 Min. :1.600
## Class :character 1st Qu.:1.000e+07 1st Qu.:1999 1st Qu.:5.900
## Mode :character Median :2.500e+07 Median :2005 Median :6.600
## Mean :4.585e+07 Mean :2003 Mean :6.466
## 3rd Qu.:5.000e+07 3rd Qu.:2010 3rd Qu.:7.200
## Max. :1.222e+10 Max. :2016 Max. :9.300
## movie_facebook_likes Action Adventure Animation
## Min. : 0 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.: 0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median : 217 Median :0.0000 Median :0.0000 Median :0.00000
## Mean : 9208 Mean :0.2532 Mean :0.2059 Mean :0.05228
## 3rd Qu.: 11000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :349000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## Biography Comedy Crime Documentary
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000
## Median :0.00000 Median :0.0000 Median :0.0000 Median :0.000
## Mean :0.06396 Mean :0.3893 Mean :0.1887 Mean :0.013
## 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.000
## Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.000
## Drama Family Fantasy Film-Noir
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000000
## Median :1.0000 Median :0.0000 Median :0.0000 Median :0.0000000
## Mean :0.5061 Mean :0.1176 Mean :0.1351 Mean :0.0002654
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000000
## Game-Show History Horror Music
## Min. :0 Min. :0.00000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0 Median :0.00000 Median :0.0000 Median :0.00000
## Mean :0 Mean :0.03954 Mean :0.1032 Mean :0.04007
## 3rd Qu.:0 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :0 Max. :1.00000 Max. :1.0000 Max. :1.00000
## Musical Mystery News Reality-TV Romance
## Min. :0.00000 Min. :0.0000 Min. :0 Min. :0 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0 1st Qu.:0 1st Qu.:0.0000
## Median :0.00000 Median :0.0000 Median :0 Median :0 Median :0.0000
## Mean :0.02654 Mean :0.1008 Mean :0 Mean :0 Mean :0.2282
## 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.0000 Max. :0 Max. :0 Max. :1.0000
## Sci-Fi Short Sport Thriller
## Min. :0.0000 Min. :0 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000 Median :0 Median :0.00000 Median :0.0000
## Mean :0.1311 Mean :0 Mean :0.03901 Mean :0.2946
## 3rd Qu.:0.0000 3rd Qu.:0 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :1.0000 Max. :0 Max. :1.00000 Max. :1.0000
## War Western
## Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000
## Mean :0.04087 Mean :0.01566
## 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000
The Given data set of IMDb rating has varied data,and outlier treatment cannot be undertaken as outliers in budget and IMDb score can be a key focus area like high budget movies.
Adding relevant columns
# Creating 'profit' column
IMDB1$profit <- IMDB1$gross - IMDB1$budget
# Creating 'return_on_investment_perc' column
IMDB1$return_on_investment_perc <- (IMDB1$profit / IMDB1$budget) * 100
Data Visualization
# Plotting Histogram for movie release year
movie_rels <- hist(IMDB1$title_year, breaks = 30, main = "Histogram of Movie Releases",
xlab = "Year movie was released", ylab = "Movie Count", col = "skyblue")
movie_rels
## $breaks
## [1] 1920 1925 1930 1935 1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990
## [16] 1995 2000 2005 2010 2015 2020
##
## $counts
## [1] 1 2 2 5 0 5 5 3 16 11 20 37 83 143 226 629 876 879 765
## [20] 60
##
## $density
## [1] 5.307856e-05 1.061571e-04 1.061571e-04 2.653928e-04 0.000000e+00
## [6] 2.653928e-04 2.653928e-04 1.592357e-04 8.492569e-04 5.838641e-04
## [11] 1.061571e-03 1.963907e-03 4.405520e-03 7.590234e-03 1.199575e-02
## [16] 3.338641e-02 4.649682e-02 4.665605e-02 4.060510e-02 3.184713e-03
##
## $mids
## [1] 1922.5 1927.5 1932.5 1937.5 1942.5 1947.5 1952.5 1957.5 1962.5 1967.5
## [11] 1972.5 1977.5 1982.5 1987.5 1992.5 1997.5 2002.5 2007.5 2012.5 2017.5
##
## $xname
## [1] "IMDB1$title_year"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
From the graph, it can be infered that there aren’t many records of movies released before 1980.
TOP 20 MOST PROFITABLE MOVIES
#Sort IMDB1 by profit in descending order
sorted_IMDB <- IMDB1[order(IMDB1$profit, decreasing = TRUE), ]
# Select the top 20 most profitable movies
top_20 <- head(sorted_IMDB, 20)
top_20
## # A tibble: 20 x 42
## director_name num_critic_for_reviews gross actor_1_name movie_title
## <chr> <int> <int> <chr> <chr>
## 1 James Cameron 723 760505847 CCH Pounder Avatar
## 2 Colin Trevorrow 644 652177271 Bryce Dallas ~ Jurassic W~
## 3 James Cameron 315 658672302 Leonardo DiCa~ Titanic
## 4 George Lucas 282 460935665 Harrison Ford Star Wars:~
## 5 Steven Spielberg 215 434949459 Henry Thomas E.T. the E~
## 6 Joss Whedon 703 623279547 Chris Hemswor~ The Avenge~
## 7 Roger Allers 186 422783777 Matthew Brode~ The Lion K~
## 8 George Lucas 320 474544677 Natalie Portm~ Star Wars:~
## 9 Christopher Nolan 645 533316061 Christian Bale The Dark K~
## 10 Gary Ross 673 407999255 Jennifer Lawr~ The Hunger~
## 11 Tim Miller 579 363024263 Ryan Reynolds Deadpool
## 12 Francis Lawrence 502 424645577 Jennifer Lawr~ The Hunger~
## 13 Steven Spielberg 308 356784000 Wayne Knight Jurassic P~
## 14 Pierre Coffin 306 368049635 Steve Carell Despicable~
## 15 Clint Eastwood 490 350123553 Bradley Cooper American S~
## 16 Andrew Stanton 301 380838870 Alexander Gou~ Finding Ne~
## 17 Andrew Adamson 205 436471036 Rupert Everett Shrek 2
## 18 Peter Jackson 328 377019252 Orlando Bloom The Lord o~
## 19 Richard Marquand 197 309125409 Harrison Ford Star Wars:~
## 20 Robert Zemeckis 149 329691196 Tom Hanks Forrest Gu~
## # i 37 more variables: num_voted_users <int>, cast_total_facebook_likes <int>,
## # facenumber_in_poster <int>, num_user_for_reviews <int>,
## # content_rating <chr>, budget <dbl>, title_year <int>, imdb_score <dbl>,
## # movie_facebook_likes <int>, Action <dbl>, Adventure <dbl>, Animation <dbl>,
## # Biography <dbl>, Comedy <dbl>, Crime <dbl>, Documentary <dbl>, Drama <dbl>,
## # Family <dbl>, Fantasy <dbl>, `Film-Noir` <dbl>, `Game-Show` <dbl>,
## # History <dbl>, Horror <dbl>, Music <dbl>, Musical <dbl>, Mystery <dbl>, ...
Create scatter plot with regression line
ggplot(top_20, aes(x = budget / 1e6, y = profit / 1e6, label = movie_title)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
geom_text(hjust = -0.1, vjust = -0.5, size = 3) +
labs(x = "Budget $million", y = "Profit $million", title = "Top 20 Profitable Movies") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: label.
## i This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## i Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
It can be inferred from this plot that high budget movies tend to earn more profit. The trend is almost linear, with profit increasing with the increase in budget.
THE 20 MOST PROFITABLE MOVIES BASED ON INVESTMENT Sort the DataFrame by profit in descending order
sorted_IMDB1 <- IMDB1[order(IMDB1$profit, decreasing = TRUE), ]
top_20_1 <- head(sorted_IMDB1, 20)
top_20_1
## # A tibble: 20 x 42
## director_name num_critic_for_reviews gross actor_1_name movie_title
## <chr> <int> <int> <chr> <chr>
## 1 James Cameron 723 760505847 CCH Pounder Avatar
## 2 Colin Trevorrow 644 652177271 Bryce Dallas ~ Jurassic W~
## 3 James Cameron 315 658672302 Leonardo DiCa~ Titanic
## 4 George Lucas 282 460935665 Harrison Ford Star Wars:~
## 5 Steven Spielberg 215 434949459 Henry Thomas E.T. the E~
## 6 Joss Whedon 703 623279547 Chris Hemswor~ The Avenge~
## 7 Roger Allers 186 422783777 Matthew Brode~ The Lion K~
## 8 George Lucas 320 474544677 Natalie Portm~ Star Wars:~
## 9 Christopher Nolan 645 533316061 Christian Bale The Dark K~
## 10 Gary Ross 673 407999255 Jennifer Lawr~ The Hunger~
## 11 Tim Miller 579 363024263 Ryan Reynolds Deadpool
## 12 Francis Lawrence 502 424645577 Jennifer Lawr~ The Hunger~
## 13 Steven Spielberg 308 356784000 Wayne Knight Jurassic P~
## 14 Pierre Coffin 306 368049635 Steve Carell Despicable~
## 15 Clint Eastwood 490 350123553 Bradley Cooper American S~
## 16 Andrew Stanton 301 380838870 Alexander Gou~ Finding Ne~
## 17 Andrew Adamson 205 436471036 Rupert Everett Shrek 2
## 18 Peter Jackson 328 377019252 Orlando Bloom The Lord o~
## 19 Richard Marquand 197 309125409 Harrison Ford Star Wars:~
## 20 Robert Zemeckis 149 329691196 Tom Hanks Forrest Gu~
## # i 37 more variables: num_voted_users <int>, cast_total_facebook_likes <int>,
## # facenumber_in_poster <int>, num_user_for_reviews <int>,
## # content_rating <chr>, budget <dbl>, title_year <int>, imdb_score <dbl>,
## # movie_facebook_likes <int>, Action <dbl>, Adventure <dbl>, Animation <dbl>,
## # Biography <dbl>, Comedy <dbl>, Crime <dbl>, Documentary <dbl>, Drama <dbl>,
## # Family <dbl>, Fantasy <dbl>, `Film-Noir` <dbl>, `Game-Show` <dbl>,
## # History <dbl>, Horror <dbl>, Music <dbl>, Musical <dbl>, Mystery <dbl>, ...
Create scatter plot with regression line and text labels
ggplot(data = top_20_1, aes(x = budget / 1e6, y = return_on_investment_perc)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
geom_text(aes(label = movie_title), hjust = 0, vjust = 0, size = 3, nudge_y = 0.2) +
labs(x = "Budget ($million)", y = "Percent Return on Investment", title = "Top 20 Movies based on Return on Investment") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
These are the top 20 movies based on its Percentage Return on Investment ((profit/budget)*100).
Since profit earned by a movie does not give a clear picture about its monetary success over the years, this analysis, over the absolute value of the Return on Investment(ROI) across its Budget, would provide better results. The ROI is high for Low Budget Films and decreases as the budget of the movie increases.
TOP 20 DIRECTORS WITH AVERAGE HIGHEST IMBD RATING
director_avg_imdb <- aggregate(imdb_score ~ director_name, data = IMDB1, FUN = mean) %>% arrange(desc(imdb_score))
head(director_avg_imdb,10)
## director_name imdb_score
## 1 Charles Chaplin 8.600000
## 2 Tony Kaye 8.600000
## 3 Alfred Hitchcock 8.500000
## 4 Damien Chazelle 8.500000
## 5 Majid Majidi 8.500000
## 6 Ron Fricke 8.500000
## 7 Sergio Leone 8.433333
## 8 Christopher Nolan 8.425000
## 9 Asghar Farhadi 8.400000
## 10 Richard Marquand 8.400000
Selecting the top 20 directors with the highest average IMDb scores
top_20_directors <- head(director_avg_imdb, 20)
top_20_directors
## director_name imdb_score
## 1 Charles Chaplin 8.600000
## 2 Tony Kaye 8.600000
## 3 Alfred Hitchcock 8.500000
## 4 Damien Chazelle 8.500000
## 5 Majid Majidi 8.500000
## 6 Ron Fricke 8.500000
## 7 Sergio Leone 8.433333
## 8 Christopher Nolan 8.425000
## 9 Asghar Farhadi 8.400000
## 10 Richard Marquand 8.400000
## 11 S.S. Rajamouli 8.400000
## 12 Billy Wilder 8.300000
## 13 Charles Ferguson 8.300000
## 14 Fritz Lang 8.300000
## 15 Lee Unkrich 8.300000
## 16 Lenny Abrahamson 8.300000
## 17 Pete Docter 8.233333
## 18 Hayao Miyazaki 8.225000
## 19 Elia Kazan 8.200000
## 20 George Roy Hill 8.200000
Creating a horizontal bar plot for the top directors
ggplot(top_20_directors, aes(x = imdb_score, y = reorder(director_name, imdb_score))) +
geom_bar(stat = "identity", fill = "orange") +
labs(x = "Average IMDb Score", y = "Director Name", title = "Top 20 Directors with Highest Average IMDb Scores") +
theme_minimal()
TOP DIRECTORS BY TOTAL PROFIT
Calculating the total profit for each director
director_profit <- aggregate(profit ~ director_name, data = IMDB1, FUN = sum)%>%arrange(desc(profit))
Selecting the top 20 directors
top_directors <- head(director_profit, 10)
top_directors
## director_name profit
## 1 Steven Spielberg 2486332231
## 2 George Lucas 1386641480
## 3 James Cameron 1199625910
## 4 Chris Columbus 941707624
## 5 Tim Burton 824275480
## 6 Christopher Nolan 808227576
## 7 Peter Jackson 777968050
## 8 Jon Favreau 769381547
## 9 Francis Lawrence 755501971
## 10 Michael Bay 644242537
Creating a bar plot
ggplot(top_directors, aes(x = director_name, y = profit)) +
geom_bar(stat = "identity", fill = "royalblue") +
labs(x = "Director Name", y = "Total Profit", title = "Top Directors by Total Profit") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_y_continuous(labels = scales::comma)
Thus it can be infered that top directors can raise prospect of profits but may not promise high ratings
SCATTER PLOT OF MOVIE FACEBOOK LIKES VS IMDB SCORE
ggplot(IMDB, aes(x = movie_facebook_likes, y = imdb_score, color = content_rating)) +
geom_point(alpha = 0.7) +
labs(x = "Movie Facebook Likes", y = "IMDb Score", title = "Scatter Plot of IMDb Score vs. Movie Facebook Likes") +
theme_minimal()
Movie with extremely high Facebook likes tend to have higher imdb score. But the score for movie with low Facebook likes vary in a very wide range.
COMPARISON BETWEEN NUM_CRITICS AND MOVIES_FACEBOOK_LIKES
ggplot(IMDB, aes(x = num_critic_for_reviews, y = movie_facebook_likes)) +
geom_point(alpha = 0.5) +
labs(x = "Number of Critic Reviews", y = "Movie Facebook Likes", title = "Comparison of Number of Critic Reviews vs. Movie Facebook Likes") +
scale_y_continuous(labels = scales::comma) +
theme_minimal()
It can be inferred that the number of critics are instrumental in forming a unbiased opinion that effects the public opinion.
HEATMAP GIVING CORELATION BETWEEN DIFFERNT COLUMNS IN THE IMDB DATA
IMDB2 <- IMDB1[, c('num_critic_for_reviews', 'cast_total_facebook_likes', 'num_user_for_reviews',
'title_year', 'movie_facebook_likes', 'num_voted_users', 'facenumber_in_poster', 'budget')]
corr <- cor(IMDB2)
corrplot(corr, method = "color", type = "full", tl.col = "black", tl.srt = 45,
col = colorRampPalette(c("navy", "white", "firebrick3"))(100),
title = "Correlation Heatmap")
Based on the heatmap, we can see some high correlations (greater than 0.7) between predictors. According to the highest correlation value 0.95, we find actor_1_facebook_likes is highly correlated with the cast_total_facebook_likes There are high correlations among num_voted_users, num_user_for_reviews and num_critic_for_reviews. We want to keep num_voted_users and take the ratio of num_user_for_reviews and num_critic_for_reviews.
PROFIT VS IMDB SCORE
ggplot(IMDB1, aes(x = profit, y = imdb_score)) +
geom_point(alpha = 0.5) +
labs(x = "Profit", y = "IMDb Score", title = "Scatter Plot of Profit vs. IMDb Score") +
theme_minimal()
It can be infered that highly rated movies have a higher chance of encountering losses. Loss prospect is higher in high budget movies with IMDb Score >6. From the heatmap it can be seen that num_critic_for_reviews and movie_facebook_likes have high degree of co relattion thus we’ll further analyse them
CRITIC REVIEW VS IMDB SCORE
ggplot(IMDB, aes(x = num_critic_for_reviews, y = imdb_score)) +
geom_point(alpha = 0.5) +
labs(x = "Number of Critic Reviews", y = "IMDb Score", title = "Scatter Plot of Number of Critic Reviews vs. IMDb Score") +
theme_minimal()
It can be inferred that there is high correlation as the number of critics rises the IMDb Score is high.
One Hot Encoding
Since the dataset of IMDB is already imported above and also normalised we need not to import it again.
OHE_IMDB <- IMDB
Identifying the Categorical Variables
categorical_vars <- c("color", "director_name", "actor_1_name", "actor_2_name", "actor_3_name", "language", "country", "content_rating")
Performing The One Hot Encoding
encoded_data <- OHE_IMDB %>%
select(all_of(categorical_vars)) %>%
mutate_all(funs(as.factor)) %>%
mutate_all(funs(as.numeric)) %>%
as.data.frame()
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## i Please use a list of either functions or lambdas:
##
## # Simple named list: list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
##
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## i Please use a list of either functions or lambdas:
##
## # Simple named list: list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
##
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Principal Component Analysis
library('corrr')
## Warning: package 'corrr' was built under R version 4.1.3
library('ggcorrplot')
library("FactoMineR")
## Warning: package 'FactoMineR' was built under R version 4.1.3
library('caret')
## Warning: package 'caret' was built under R version 4.1.3
## Loading required package: lattice
library('factoextra')
## Warning: package 'factoextra' was built under R version 4.1.3
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Removing Null Values
imdb_clean <- na.omit(encoded_data)
dmy <- dummyVars(" ~.", data = imdb_clean)
trsf <- data.frame(predict(dmy, newdata = imdb_clean))
head(trsf,10)
## color director_name actor_1_name actor_2_name actor_3_name language country
## 1 3 634 220 1024 2569 12 45
## 2 3 549 699 1621 1015 12 45
## 3 3 1426 263 1826 2332 12 44
## 4 3 258 1366 393 1287 12 45
## 5 3 67 327 1871 1998 12 45
## 6 3 1429 564 900 1434 12 45
## 7 3 1149 163 592 1565 12 45
## 8 3 856 249 1789 2244 12 45
## 9 3 372 24 481 2187 12 44
## 10 3 1691 530 1246 35 12 45
## content_rating
## 1 10
## 2 10
## 3 10
## 4 10
## 5 10
## 6 10
## 7 9
## 8 10
## 9 9
## 10 10
colnames(imdb_clean)
## [1] "color" "director_name" "actor_1_name" "actor_2_name"
## [5] "actor_3_name" "language" "country" "content_rating"
numerical_data <- imdb_clean[,1:8]
head(numerical_data)
## color director_name actor_1_name actor_2_name actor_3_name language country
## 1 3 634 220 1024 2569 12 45
## 2 3 549 699 1621 1015 12 45
## 3 3 1426 263 1826 2332 12 44
## 4 3 258 1366 393 1287 12 45
## 5 3 67 327 1871 1998 12 45
## 6 3 1429 564 900 1434 12 45
## content_rating
## 1 10
## 2 10
## 3 10
## 4 10
## 5 10
## 6 10
Normalizing the data
data_normalized <- scale(numerical_data)
head(data_normalized)
## color director_name actor_1_name actor_2_name actor_3_name language
## 1 0.187521 -0.4885207 -1.22641097 -0.1326478 1.63987050 -0.1445761
## 2 0.187521 -0.6602714 -0.07138804 0.7872096 -0.40001098 -0.1445761
## 3 0.187521 1.1117923 -1.12272415 1.1030735 1.32876889 -0.1445761
## 4 0.187521 -1.2482652 1.53696330 -1.1048924 -0.04296609 -0.1445761
## 5 0.187521 -1.6341993 -0.96839958 1.1724095 0.89033876 -0.1445761
## 6 0.187521 1.1178541 -0.39691642 -0.3237070 0.14999567 -0.1445761
## country content_rating
## 1 0.3613929 -0.008526776
## 2 0.3613929 -0.008526776
## 3 0.2608147 -0.008526776
## 4 0.3613929 -0.008526776
## 5 0.3613929 -0.008526776
## 6 0.3613929 -0.008526776
Co-relation Matrix for all the components
corr_matrix <- cor(data_normalized)
ggcorrplot(corr_matrix)
Forming the principle components
data.pca <- princomp(corr_matrix)
summary(data.pca)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 0.4281121 0.3765415 0.3623430 0.3444055 0.3410672
## Proportion of Variance 0.2055552 0.1590154 0.1472493 0.1330313 0.1304648
## Cumulative Proportion 0.2055552 0.3645706 0.5118199 0.6448512 0.7753160
## Comp.6 Comp.7 Comp.8
## Standard deviation 0.3372233 0.29430640 1.580507e-08
## Proportion of Variance 0.1275407 0.09714332 2.801601e-16
## Cumulative Proportion 0.9028567 1.00000000 1.000000e+00
Scree Plot
fviz_eig(data.pca, addlabels = TRUE)
Biplot of the attributes
# Graph of the variables
fviz_pca_var(data.pca, col.var = "black")
Contribution of each variable
fviz_cos2(data.pca, choice = "var", axes = 1:2)
Biplot combined with cos2
fviz_pca_var(data.pca, col.var = "cos2",
gradient.cols = c("black", "orange", "green"),
repel = TRUE)
Time Series Analysis
Loading Relevant Libraries
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble 3.2.1 v purrr 0.3.4
## v readr 2.1.2 v forcats 0.5.1
## Warning: package 'tibble' was built under R version 4.1.3
## Warning: package 'readr' was built under R version 4.1.3
## Warning: package 'purrr' was built under R version 4.1.3
## Warning: package 'forcats' was built under R version 4.1.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x plotly::filter() masks dplyr::filter(), stats::filter()
## x readr::guess_encoding() masks rvest::guess_encoding()
## x dplyr::lag() masks stats::lag()
## x purrr::lift() masks caret::lift()
library(readr)
library(dplyr)
library(snakecase)
## Warning: package 'snakecase' was built under R version 4.1.3
Loading the data set
Superstore <- read_csv("C:\\Users\\chinmay.Jain\\Desktop\\R\\Superstore.csv")
## Rows: 9800 Columns: 18
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (15): Order ID, Order Date, Ship Date, Ship Mode, Customer ID, Customer ...
## dbl (3): Row ID, Postal Code, Sales
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Extracting the 1st 5 entries
head(Superstore)
## # A tibble: 6 x 18
## `Row ID` `Order ID` `Order Date` `Ship Date` `Ship Mode` `Customer ID`
## <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 541 CA-2015-140795 1/2/2015 3/2/2015 First Class BD-11500
## 2 158 CA-2015-104269 1/3/2015 6/3/2015 Second Class DB-13060
## 3 5714 US-2015-143707 1/3/2015 5/3/2015 Standard Class HR-14770
## 4 6548 CA-2015-113880 1/3/2015 5/3/2015 Standard Class VF-21715
## 5 6549 CA-2015-113880 1/3/2015 5/3/2015 Standard Class VF-21715
## 6 7948 CA-2015-131009 1/3/2015 5/3/2015 Standard Class SC-20380
## # i 12 more variables: `Customer Name` <chr>, Segment <chr>, Country <chr>,
## # City <chr>, State <chr>, `Postal Code` <dbl>, Region <chr>,
## # `Product ID` <chr>, Category <chr>, `Sub-Category` <chr>,
## # `Product Name` <chr>, Sales <dbl>
Converting the headers to snake case
colnames(Superstore)<-to_snake_case(colnames(Superstore))
colnames(Superstore)
## [1] "row_id" "order_id" "order_date" "ship_date"
## [5] "ship_mode" "customer_id" "customer_name" "segment"
## [9] "country" "city" "state" "postal_code"
## [13] "region" "product_id" "category" "sub_category"
## [17] "product_name" "sales"
Converting ‘order_date’ column to proper date format
Superstore$order_date <- as.Date(Superstore$order_date, format = "%d/%m/%Y")
Converting ‘ship_date’ column to proper date format
Superstore$ship_date <- as.Date(Superstore$ship_date, format = "%d/%m/%Y")
head(Superstore)
## # A tibble: 6 x 18
## row_id order_id order_date ship_date ship_mode customer_id customer_name
## <dbl> <chr> <date> <date> <chr> <chr> <chr>
## 1 541 CA-2015-1407~ 2015-02-01 2015-02-03 First Cl~ BD-11500 Bradley Druc~
## 2 158 CA-2015-1042~ 2015-03-01 2015-03-06 Second C~ DB-13060 Dave Brooks
## 3 5714 US-2015-1437~ 2015-03-01 2015-03-05 Standard~ HR-14770 Hallie Redmo~
## 4 6548 CA-2015-1138~ 2015-03-01 2015-03-05 Standard~ VF-21715 Vicky Freyma~
## 5 6549 CA-2015-1138~ 2015-03-01 2015-03-05 Standard~ VF-21715 Vicky Freyma~
## 6 7948 CA-2015-1310~ 2015-03-01 2015-03-05 Standard~ SC-20380 Shahid Colli~
## # i 11 more variables: segment <chr>, country <chr>, city <chr>, state <chr>,
## # postal_code <dbl>, region <chr>, product_id <chr>, category <chr>,
## # sub_category <chr>, product_name <chr>, sales <dbl>
Grouping by the product category and order_date
base <- Superstore %>%
group_by(order_date,category) %>%
summarise(sales = sum(sales))
## `summarise()` has grouped output by 'order_date'. You can override using the
## `.groups` argument.
head(base)
## # A tibble: 6 x 3
## # Groups: order_date [4]
## order_date category sales
## <date> <chr> <dbl>
## 1 2015-01-03 Office Supplies 16.4
## 2 2015-01-04 Office Supplies 288.
## 3 2015-01-05 Office Supplies 19.5
## 4 2015-01-06 Furniture 2574.
## 5 2015-01-06 Office Supplies 685.
## 6 2015-01-06 Technology 1148.
Extracting year, month, and quarter from the order_date column
base$Year <- format(base$order_date, "%Y")
base$Month <- format(base$order_date, "%m")
base$Quarter <- quarters(base$order_date)
Aggregating sales at yearly level product category-wise
yearly_sales <- aggregate(sales ~ Year + category, data = base, sum)
Creating time series object for each year
salests_yearly <- ts(yearly_sales$sales, start = c(min(base$Year)))
head(salests_yearly)
## [1] 156477.9 164053.9 195813.0 212313.8 149512.8 133124.4
Aggregating sales for each quarter of each year product category-wise
quarterly_sales <- aggregate(sales ~ Year + Quarter + category, data = base, sum)
Converting to ts object for quarterly sales of each year
salests_quarterly <- ts(quarterly_sales$sales, frequency = 4)
head(salests_quarterly)
## [1] 22300.30 23596.95 23820.96 23597.98 28002.21 27391.62
Aggregating sales at monthly level product category-wise
monthly_sales <- aggregate(sales ~ Year + Month + category, data = base, sum)
Creating time series object for 12 months of each year
salests_monthly <- ts(monthly_sales$sales, frequency = 12, start = c(2015, 1))
head(salests_monthly)
## [1] 6217.277 11739.942 7622.743 5930.162 1839.658 3134.374
Creating time series object for sales for each day
salests_daily <- ts(base$sales)
head(salests_daily)
## [1] 16.448 288.060 19.536 2573.820 685.340 1147.940
Plotting Graphs for the daily, monthly, quaterly and yearly sales
plot.ts(salests_daily)
plot.ts(salests_monthly)
plot.ts(salests_quarterly)
plot.ts(salests_yearly)
Transforming to log time series
logdaily<-log(salests_daily)
logmonthly<-log(salests_monthly)
logquaterly<-log(salests_quarterly)
logquaterly<-log(salests_yearly)
Plotting Graphs for time series trasformed using log
plot.ts(logdaily)
plot.ts(logmonthly)
plot.ts(logquaterly)
plot.ts(logquaterly)
Decomposing Time Series
library("TTR")
## Warning: package 'TTR' was built under R version 4.1.3
Simple Moving Average
plot.ts(SMA(salests_daily))
plot.ts(SMA(salests_monthly))
plot.ts(SMA(salests_quarterly))
plot.ts(SMA(salests_yearly))
Decomposing Seasonal Data
salests_monthlyd<- decompose(salests_monthly)
salests_quarterlyd<- decompose(salests_quarterly)
plot(salests_monthlyd)
plot(salests_quarterlyd)
ARIMA Models
Daily Sales
dailydiff1 <- diff(salests_daily, differences=1)
plot.ts(dailydiff1)
dailydiff2 <- diff(salests_daily, differences=2)
plot.ts(dailydiff2)
Monthly Sales
monthlydiff1 <- diff(salests_monthly, differences=1)
plot.ts(monthlydiff1)
monthlydiff2 <- diff(salests_monthly, differences=2)
plot.ts(monthlydiff2)
Quarterly Sales
quaterlydiff1 <- diff(salests_quarterly, differences=1)
plot.ts(quaterlydiff1)
quaterlydiff2 <- diff(salests_quarterly, differences=2)
plot.ts(quaterlydiff2)
Yearly Sales
yearlydiff1 <- diff(salests_yearly, differences=1)
plot.ts(yearlydiff1)
yearlydiff2 <- diff(salests_yearly, differences=2)
plot.ts(yearlydiff1)
Selecting a Candidate ARIMA Model
Daily
acf(dailydiff1, lag.max=20)
acf(dailydiff1, lag.max=20, plot=FALSE)
##
## Autocorrelations of series 'dailydiff1', by lag
##
## 0 1 2 3 4 5 6 7 8 9 10
## 1.000 -0.491 -0.005 -0.010 0.001 0.006 0.007 0.013 -0.034 -0.005 0.048
## 11 12 13 14 15 16 17 18 19 20
## -0.038 0.016 -0.019 0.005 0.007 -0.005 0.028 -0.041 0.018 0.005
Monthly
acf(monthlydiff1, lag.max=20)
acf(monthlydiff1, lag.max=20, plot=FALSE)
##
## Autocorrelations of series 'monthlydiff1', by lag
##
## 0.0000 0.0833 0.1667 0.2500 0.3333 0.4167 0.5000 0.5833 0.6667 0.7500 0.8333
## 1.000 -0.332 -0.190 0.064 0.043 0.075 -0.250 0.041 0.137 -0.068 -0.115
## 0.9167 1.0000 1.0833 1.1667 1.2500 1.3333 1.4167 1.5000 1.5833 1.6667
## 0.031 0.204 -0.058 -0.112 -0.041 0.234 -0.095 -0.154 0.029 0.050
Quarterly
acf(quaterlydiff1, lag.max=20)
acf(quaterlydiff1, lag.max=20, plot=FALSE)
##
## Autocorrelations of series 'quaterlydiff1', by lag
##
## 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50
## 1.000 -0.126 -0.429 0.089 0.278 -0.059 -0.247 0.040 0.142 0.037 -0.219
## 2.75 3.00 3.25 3.50 3.75 4.00 4.25 4.50 4.75 5.00
## -0.091 0.202 0.030 -0.233 -0.127 0.350 0.099 -0.204 -0.057 0.197
Yearly
acf(yearlydiff2, lag.max=20)
acf(yearlydiff2, lag.max=20, plot=FALSE)
##
## Autocorrelations of series 'yearlydiff2', by lag
##
## 0 1 2 3 4 5 6 7 8 9
## 1.000 -0.113 -0.661 -0.017 0.486 0.005 -0.265 0.029 0.045 -0.009
ARIMA
library(forecast)
## Warning: package 'forecast' was built under R version 4.1.3
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
daily_a <- auto.arima(salests_daily)
monthly_a<-auto.arima(salests_monthly)
quaterly_a<-auto.arima(salests_quarterly)
yearly_a<-auto.arima(salests_yearly)
daily_a
## Series: salests_daily
## ARIMA(1,1,2)
##
## Coefficients:
## ar1 ma1 ma2
## 0.5058 -1.4638 0.4698
## s.e. 0.4697 0.4713 0.4658
##
## sigma^2 = 1655252: log likelihood = -24346.61
## AIC=48701.21 AICc=48701.22 BIC=48725.01
monthly_a
## Series: salests_monthly
## ARIMA(0,1,2)(1,0,2)[12]
##
## Coefficients:
## ma1 ma2 sar1 sma1 sma2
## -0.5652 -0.1799 -0.8527 1.1133 0.3979
## s.e. 0.0874 0.0922 0.1364 0.1560 0.0965
##
## sigma^2 = 57131580: log likelihood = -1480.7
## AIC=2973.41 AICc=2974.03 BIC=2991.19
quaterly_a
## Series: salests_quarterly
## ARIMA(0,1,0)
##
## sigma^2 = 347299618: log likelihood = -528.83
## AIC=1059.67 AICc=1059.76 BIC=1061.52
yearly_a
## Series: salests_yearly
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 188461.40
## s.e. 11217.56
##
## sigma^2 = 1.647e+09: log likelihood = -143.84
## AIC=291.68 AICc=293.01 BIC=292.65